18 research outputs found
Stability and Expressiveness of Deep Generative Models
In den letzten Jahren hat Deep Learning sowohl das maschinelle Lernen als auch die maschinelle Bildverarbeitung revolutioniert. Viele klassische Computer Vision-Aufgaben, wie z.B. die Objekterkennung und semantische Segmentierung, die traditionell sehr anspruchsvoll waren, können nun mit Hilfe von ĂŒberwachten Deep Learning-Techniken gelöst werden. Ăberwachtes Lernen ist ein mĂ€chtiges Werkzeug, wenn annotierte Daten verfĂŒgbar sind und die betrachtete Aufgabe eine eindeutige Lösung hat. Diese Bedingungen sind allerdings nicht immer erfĂŒllt. Ein vielversprechender Ansatz ist in diesem Fall die generative Modellierung. Im Gegensatz zu rein diskriminativen Modellen können generative Modelle mit Unsicherheiten umgehen und leistungsfĂ€hige Modelle lernen, auch wenn keine annotierten Trainingsdaten verfĂŒgbar sind. Obwohl aktuelle AnsĂ€tze zur generativen Modellierung vielversprechende Ergebnisse erzielen, beeintrĂ€chtigen zwei Aspekte ihre ExpressivitĂ€t: (i) Einige der erfolgreichsten AnsĂ€tze zur Modellierung von Bilddaten werden nicht mehr mit Hilfe von Optimierungsalgorithmen trainiert, sondern mit Algorithmen, deren Dynamik bisher nicht gut verstanden wurde. (ii) Generative Modelle sind oft durch den Speicherbedarf der AusgabereprĂ€sentation begrenzt. In dieser Arbeit gehen wir auf beide Probleme ein: Im ersten Teil der Arbeit stellen wir eine Theorie vor, die es erlaubt, die Trainingsdynamik von Generative Adversarial Networks (GANs), einem der vielversprechendsten AnsĂ€tze zur generativen Modellierung, besser zu verstehen. Wir nĂ€hern uns dieser Problemstellung, indem wir minimale Beispielprobleme des GAN-Trainings vorstellen, die analytisch verstanden werden können. AnschlieĂend erhöhen wir schrittweise die KomplexitĂ€t dieser Beispiele. Dadurch gewinnen wir neue Einblicke in die Trainingsdynamik von GANs und leiten neue Regularisierer her, die auch fĂŒr allgemeine GANs sehr gut funktionieren. Insbesondere ermöglichen unsere neuen Regularisierer erstmals, ein GAN mit einer Auflösung von einem Megapixel zu trainieren, ohne dass wir die Auflösung der Trainingsverteilung schrittweise erhöhen mĂŒssen. Im zweiten Teil dieser Arbeit betrachten wir AusgabereprĂ€sentationen fĂŒr generative Modelle in 3D und fĂŒr 3D-Rekonstruktionstechniken. Durch die EinfĂŒhrung von impliziten ReprĂ€sentationen sind wir in der Lage, viele Techniken, die in 2D funktionieren, auf den 3D-Bereich auszudehnen ohne ihre ExpressivitĂ€t einzuschrĂ€nken.In recent years, deep learning has revolutionized both machine learning and computer vision. Many classical computer vision tasks (e.g. object detection and semantic segmentation), which traditionally were very challenging, can now be solved using supervised deep learning techniques. While supervised learning is a powerful tool when labeled data is available and the task under consideration has a well-defined output, these conditions are not always satisfied. One promising approach in this case is given by generative modeling. In contrast to purely discriminative models, generative models can deal with uncertainty and learn powerful models even when labeled training data is not available. However, while current approaches to generative modeling achieve promising results, they suffer from two aspects that limit their expressiveness: (i) some of the most successful approaches to modeling image data are no longer trained using optimization algorithms, but instead employ algorithms whose dynamics are not well understood and (ii) generative models are often limited by the memory requirements of the output representation. We address both problems in this thesis: in the first part we introduce a theory which enables us to better understand the training dynamics of Generative Adversarial Networks (GANs), one of the most promising approaches to generative modeling. We tackle this problem by introducing minimal example problems of GAN training which can be understood analytically. Subsequently, we gradually increase the complexity of these examples. By doing so, we gain new insights into the training dynamics of GANs and derive new regularizers that also work well for general GANs. Our new regularizers enable us - for the first time - to train a GAN at one megapixel resolution without having to gradually increase the resolution of the training distribution. In the second part of this thesis we consider output representations in 3D for generative models and 3D reconstruction techniques. By introducing implicit representations to deep learning, we are able to extend many techniques that work in 2D to the 3D domain without sacrificing their expressiveness
Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes
The success of deep learning in computer vision is based on availability of
large annotated datasets. To lower the need for hand labeled images, virtually
rendered 3D worlds have recently gained popularity. Creating realistic 3D
content is challenging on its own and requires significant human effort. In
this work, we propose an alternative paradigm which combines real and synthetic
data for learning semantic instance segmentation and object detection models.
Exploiting the fact that not all aspects of the scene are equally important for
this task, we propose to augment real-world imagery with virtual objects of the
target category. Capturing real-world images at large scale is easy and cheap,
and directly provides real background appearances without the need for creating
complex 3D models of the environment. We present an efficient procedure to
augment real images with virtual objects. This allows us to create realistic
composite images which exhibit both realistic background appearance and a large
number of complex object arrangements. In contrast to modeling complete 3D
environments, our augmentation approach requires only a few user interactions
in combination with 3D shapes of the target object. Through extensive
experimentation, we conclude the right set of parameters to produce augmented
data which can maximally enhance the performance of instance segmentation
models. Further, we demonstrate the utility of our approach on training
standard deep models for semantic instance segmentation and object detection of
cars in outdoor driving scenes. We test the models trained on our augmented
data on the KITTI 2015 dataset, which we have annotated with pixel-accurate
ground truth, and on Cityscapes dataset. Our experiments demonstrate that
models trained on augmented imagery generalize better than those trained on
synthetic data or models trained on limited amount of annotated real data
Which Training Methods for GANs do actually Converge?
Recent work has shown local convergence of GAN training for absolutely
continuous data and generator distributions. In this paper, we show that the
requirement of absolute continuity is necessary: we describe a simple yet
prototypical counterexample showing that in the more realistic case of
distributions that are not absolutely continuous, unregularized GAN training is
not always convergent. Furthermore, we discuss regularization strategies that
were recently proposed to stabilize GAN training. Our analysis shows that GAN
training with instance noise or zero-centered gradient penalties converges. On
the other hand, we show that Wasserstein-GANs and WGAN-GP with a finite number
of discriminator updates per generator update do not always converge to the
equilibrium point. We discuss these results, leading us to a new explanation
for the stability problems of GAN training. Based on our analysis, we extend
our convergence results to more general GANs and prove local convergence for
simplified gradient penalties even if the generator and data distribution lie
on lower dimensional manifolds. We find these penalties to work well in
practice and use them to learn high-resolution generative image models for a
variety of datasets with little hyperparameter tuning.Comment: conferenc
Learning Neural Light Transport
In recent years, deep generative models have gained significance due to their
ability to synthesize natural-looking images with applications ranging from
virtual reality to data augmentation for training computer vision models. While
existing models are able to faithfully learn the image distribution of the
training set, they often lack controllability as they operate in 2D pixel space
and do not model the physical image formation process. In this work, we
investigate the importance of 3D reasoning for photorealistic rendering. We
present an approach for learning light transport in static and dynamic 3D
scenes using a neural network with the goal of predicting photorealistic
images. In contrast to existing approaches that operate in the 2D image domain,
our approach reasons in both 3D and 2D space, thus enabling global illumination
effects and manipulation of 3D scene geometry. Experimentally, we find that our
model is able to produce photorealistic renderings of static and dynamic
scenes. Moreover, it compares favorably to baselines which combine path tracing
and image denoising at the same computational budget.Comment: 31 pages, 17 figure
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
In recent years, Generative Adversarial Networks have achieved impressive
results in photorealistic image synthesis. This progress nurtures hopes that
one day the classical rendering pipeline can be replaced by efficient models
that are learned directly from images. However, current image synthesis models
operate in the 2D domain where disentangling 3D properties such as camera
viewpoint or object pose is challenging. Furthermore, they lack an
interpretable and controllable representation. Our key hypothesis is that the
image generation process should be modeled in 3D space as the physical world
surrounding us is intrinsically three-dimensional. We define the new task of 3D
controllable image synthesis and propose an approach for solving it by
reasoning both in 3D space and in the 2D image domain. We demonstrate that our
model is able to disentangle latent 3D factors of simple multi-object scenes in
an unsupervised fashion from raw images. Compared to pure 2D baselines, it
allows for synthesizing scenes that are consistent wrt. changes in viewpoint or
object pose. We further evaluate various 3D representations in terms of their
usefulness for this challenging task.Comment: CVPR 202
Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision
Learning-based 3D reconstruction methods have shown impressive results.
However, most methods require 3D supervision which is often hard to obtain for
real-world datasets. Recently, several works have proposed differentiable
rendering techniques to train reconstruction models from RGB images.
Unfortunately, these approaches are currently restricted to voxel- and
mesh-based representations, suffering from discretization or low resolution. In
this work, we propose a differentiable rendering formulation for implicit shape
and texture representations. Implicit representations have recently gained
popularity as they represent shape and texture continuously. Our key insight is
that depth gradients can be derived analytically using the concept of implicit
differentiation. This allows us to learn implicit shape and texture
representations directly from RGB images. We experimentally show that our
single-view reconstructions rival those learned with full 3D supervision.
Moreover, we find that our method can be used for multi-view 3D reconstruction,
directly resulting in watertight meshes